Sentiment analysis is mostly used to describe the emotions about the given topic. It can be anything, starting from reviews of the movies and twitters, ending with opinions written in social media by a certain person. It is a powerful tool to analyze current and future trends and opinions. In our report, we took on the workshop data from IMDb movie reviews. In the beginning, we made some data exploration and calculated some ratios to find any dependencies between variables (certain words in our case). We were trying to find out if there are words strictly connected with negative or positive emotions. To play with the number of occurrences of a certain word we used Zipf's law. Then we tried to find which approach of pre-processing data is the best for logistic regression or naive Bayes. In the end, we compared some common ML models in terms of accuracy in test data and their time of evaluation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import string
import re
from nltk.corpus import stopwords
ps = nltk.PorterStemmer()
stopword = nltk.corpus.stopwords.words('english')
def clean_text(text):
text_lc = "".join([word.lower() for word in text if word not in string.punctuation]) # remove puntuation
text_rc = re.sub('[0-9]+', '', text_lc)
tokens = re.split('\W+', text_rc) # tokenization
text = [ps.stem(word) for word in tokens if word not in stopword] # remove stopwords and stemming
return text
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
"haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
"wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
"can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
"mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')
def cleaner(text):
soup = BeautifulSoup(text, 'lxml')
souped = soup.get_text()
try:
bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
except:
bom_removed = souped
stripped = re.sub(combined_pat, '', bom_removed)
stripped = re.sub(www_pat, '', stripped)
lower_case = stripped.lower()
neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
words = [x for x in tok.tokenize(letters_only) if len(x) > 1]
return (" ".join(words)).strip()
df = pd.read_csv("IMDB_sample.csv")
clean_texts = []
for i in range(0,df.shape[0]):
clean_texts.append(cleaner(df['review'][i]))
clean_df["text2"] =df['review'].apply(lambda x : clean_text(x))
from wordcloud import WordCloud
Wordclouds are great as a decoration or headline in presentations, but they don't give us much information in the analysis. The main idea of them is to show which words were used mainly in a negative or positive meaning: the bigger word in the picture, the higher frequency in the data.
neg_rev = clean_df[clean_df.target == 0]
neg_string = []
for t in neg_rev.text:
neg_string.append(t)
neg_string = pd.Series(neg_string).str.cat(sep=' ')
from sklearn.feature_extraction.text import CountVectorizer
countVectorizer = CountVectorizer()
countVectorizer2 = CountVectorizer(analyzer = clean_text)
countVector = countVectorizer.fit_transform(clean_df['text'])
countVector2 = countVectorizer2.fit_transform(clean_df['text'])
Difference in the shapes of vectorizer matrices due to the different text cleaning:
[countVector.shape,countVector2.shape]
from IPython.core.display import HTML
def multi_table(table_list):
''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell
'''
return HTML(
'<table><tr style="background-color:white;">' +
''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list]) +
'</tr></table>'
)
matrix = countVector.toarray()
neg_matrix = matrix[clean_df.target == 0]
pos_matrix = matrix[clean_df.target == 1]
neg_tf = np.sum(neg_matrix,axis=0)
pos_tf = np.sum(pos_matrix,axis=0)
neg = np.squeeze(np.asarray(neg_tf))
pos = np.squeeze(np.asarray(pos_tf))
term_freq_df = pd.DataFrame([neg,pos],columns=countVectorizer.get_feature_names()).transpose()
As a comparison, we've considered two approaches to CountVectorizer: 1st - less cleaned data, 2nd - tokenization, removing stopwords and punctuations. We've printed 10 most common used words for those two methods. At first glance, we can notice that every word has almost the same number of negative and positive representations. But in 'non-cleaned' data, these words doesn't make sense in case of sentiment analysis.
Plots just confirmed what we've noticed above. The negative frequency of the words is almost the same as the positive one - especially when data is not cleaned. Most of the words are below 10000 on the first plot and below 2000 on the second one. From the second plot, we can remark that now there are more points which occur more often as a positive/negative word -> situation changes (for better) when we clean the data.
From the summary, we can note that R-squared statistics is almost 1, which in fact means that our data is really close to the fitted regression line.
Similar results: R-squared is lower than earlier, but still high.
Zipf's law is all about the occurrences of a certain word in a text or spoken language. It turns out that the presence of a word with comparison to the one that is most frequent is approximately 1/n where n means the n-th place in usage frequency. So the second most common word will occur 1/2 as frequent as the first, third 1/3 as first and so on. In the part below we had some fun showing that this law is also true ( or approximately true ) for our data set
Red dashed lines represent the exact value of Zipf's function, the blue bars are frequences of words that occur in data set most commonly.
Taking the log-scale for the frequencies gives us a nice line. To make it more informative we added terms that occur in each segment
Same for more strictly cleared data
In this section, we're going to calculate different rates to find any dependencies.
from scipy.stats import hmean
from scipy.stats import norm
def normcdf(x):
return norm.cdf(x, x.mean(), x.std())
Plot of the harmonic mean of rate CDF and frequency CDF (for less cleaned data).
If a point is closer to the upper left corner, it is more positive, and if it is closer to the bottom right corner, it is more negative.
Plot of the harmonic mean of rate CDF and frequency CDF (for cleaned data).
In both cases, it has created an interesting, almost symmetrical shape.
from bokeh.plotting import figure,output_file,show
from bokeh.io import output_notebook, show
from bokeh.models import LinearColorMapper
from bokeh.models import HoverTool
color_mapper = LinearColorMapper(palette='Inferno256', low=min(term_freq_df.pos_normcdf_hmean), high=max(term_freq_df.pos_normcdf_hmean))
p = figure(x_axis_label='neg_normcdf_hmean', y_axis_label='pos_normcdf_hmean')
p.circle('neg_normcdf_hmean','pos_normcdf_hmean',size=5,alpha=0.3,source=term_freq_df2,color={'field': 'pos_normcdf_hmean', 'transform': color_mapper})
hover = HoverTool(tooltips=[('token','@index')])
p.add_tools(hover)
show(p)
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from time import time
from sklearn.model_selection import train_test_split
We divided our dataset into the train and test set (in proportions 80:20). In both sets, there are about 50% of negative words and 50% of positive words.
x = clean_df.text
y = clean_df.target
SEED = 2020
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=SEED)
print(f"Train test has {len(x_train)} etries where negative are {(len(x_train[y_train == 0])/len(x_train)) * 100}% \n and positive {(len(x_train[y_train == 1])/len(x_train)) * 100}% \n")
print(f"Test set has {len(x_test)} etries where negative are {(len(x_test[y_test== 0])/len(x_test)) * 100}% \n and positive {(len(x_test[y_test== 1])/len(x_test)) * 100}% \n")
def accuracy(pipeline,x_train,y_train,x_test,y_test):
t0 = time()
sentiment_fit = pipeline.fit(x_train, y_train)
#accuracy_train = pipeline.score(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
train_test_time = time() - t0
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on test data {accuracy} \n")
print(f"train and test time {train_test_time } ")
print("-"*85)
return accuracy, train_test_time
countVectorizer = CountVectorizer()
countVectorizer2 = CountVectorizer(analyzer = clean_text)
countVector = countVectorizer.fit_transform(clean_df['text'])
countVector2 = countVectorizer2.fit_transform(clean_df['text'])
lr = LogisticRegression()
n_features = np.arange(5000,30001,2500)
def nfeature_accuracy_checker(vectorizer=countVectorizer, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr,analyzer = 'word'):
result = []
print(classifier,'\n')
for n in n_features:
vectorizer.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range,analyzer = analyzer)
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', classifier)
])
nfeature_accuracy,tt_time = accuracy(checker_pipeline, x_train, y_train, x_test, y_test)
result.append((n,nfeature_accuracy,tt_time))
return result
We've compared accuracy on the train set of different data: with stop words, without stop words and cleaned one. It is clearly seen in the plot that data without stop words has the highest accuracy. What is more, there is no difference if we take 30000 or 25000 number of features.
In case of NB, the accuracy is the highest for the full cleaned data. The results for data with stopwords are definitely worse. But the difference between cleaned data and data without stop words is not relevant. Thus, in further steps, we won't use full cleaning - it isn't worthwile.
It's clearly visible that there aren't big differences between the methods.
This time, unigram has definitely worse results, but all the other methods look almost similar.
from sklearn.feature_extraction.text import TfidfVectorizer
tfdf = TfidfVectorizer()
feature_tune_tf_ug_pd = pd.DataFrame(feature_tune_tf_ug,columns=['nfeatures','validation_accuracy','train_test_time'])
feature_tune_tf_tg_pd = pd.DataFrame(feature_tune_tf_tg ,columns=['nfeatures','validation_accuracy','train_test_time'])
feature_tune_tf_sg_pd = pd.DataFrame(feature_tune_tf_sg ,columns=['nfeatures','validation_accuracy','train_test_time'])
Dotted plots are responding to TFIDF and line plots to CountVectorizer. Definitely, TFIDF works better on our dataset. Once again unigram has worse result, whereas 3-gram and 6-gram work the best.
feature_tune_tf_ug_nb_pd = pd.DataFrame(feature_tune_tf_ug_nb,columns=['nfeatures','validation_accuracy','train_test_time'])
feature_tune_tf_tg_nb_pd = pd.DataFrame(feature_tune_tf_tg_nb ,columns=['nfeatures','validation_accuracy','train_test_time'])
feature_tune_tf_sg_nb_pd = pd.DataFrame(feature_tune_tf_sg_nb ,columns=['nfeatures','validation_accuracy','train_test_time'])
For the NB classifier, unigrams look definitely worse in both cases. All the other methods work almost the same. Thus, taking into account this plot and the previous one, we will only consider 3-gram TFIDF in the last step of our sentiment analysis where we compare different methods for classification.
Beacause logistic regression is fully interpretable, we took a look at the words that had the highest (positive) and the lowest (negative) estimators for the model.
logistic_regression = LogisticRegression()
logistic_refression = logistic_regression.fit(X,y_train)
words = vectorizer.get_feature_names()
lr_beta = np.ravel(logistic_refression.coef_)
Most of the terms that are clustered in groups of certain sentiment make sens. The ones that computer takes as negative would be classified in the same way by a human (in this case by us) - the same happens for positive. It means that our regression makes sense and will probably work properly for other texts.
names = ["Logistic Regression", "Linear SVC", "LinearSVC with L1-based feature selection","Multinomial NB",
"Bernoulli NB", "Ridge Classifier", "AdaBoost", "Perceptron","Passive-Aggresive", "Nearest Centroid"]
classifiers = [
LogisticRegression(),
LinearSVC(),
Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
('classification', LinearSVC(penalty="l2"))]),
MultinomialNB(),
BernoulliNB(),
RidgeClassifier(),
AdaBoostClassifier(),
Perceptron(),
PassiveAggressiveClassifier(),
NearestCentroid()
]
zipped_clf = zip(names,classifiers)
tvec = TfidfVectorizer()
def cls_compare(vectorizer=tvec, n_features=30000, stop_words=None, ngram_range=(1, 1), classifier=zipped_clf):
result = []
vectorizer.set_params(stop_words=stop_words, max_features=n_features, ngram_range=ngram_range)
for n,c in classifier:
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', c)
])
print(f"Validation result for {n}")
print(c)
clf_accuracy,tt_time = accuracy(checker_pipeline, x_train, y_train, x_test, y_test)
result.append((n,clf_accuracy,tt_time))
return result
cls_outcome = cls_compare(stop_words = 'english',ngram_range=(1, 3))
cls_imdb_score = pd.DataFrame(cls_outcome,columns = ['model','test accuracy','time'])
To make a somehow wider conclusion we tried a couple of models and compared their accuracy and time of evaluation. Most of the models have accuracy close or higher than 85% what is positively surprising, because NLP without NN is often useless. In this case, standard models are giving us decent scores in a really short time. Again, logistic regression seems to be one of the best choices. While SVC is performing better, it is not so easy to find the most negative or positive sentences for this classifier. In case of logistic regression, we did it right above. It is good to know that hours spent in the math faculty are not going to be wasted and we can create a simple model that we fully understand and fulfill our requirements about the given task.
# perception are just weights that x@positive >=0 and x@negative < 0
#checker_pipeline = Pipeline([
# ('vectorizer', vectorizer),
# ('classifier',LogisticRegression() )
# ])
#sentiment_fit = checker_pipeline.fit(x_train, y_train)
#y_pred = sentiment_fit.predict(x_test)
#accuracy = accuracy_score(y_test, y_pred)
import pytreebank
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, accuracy_score
from sklearn.metrics import confusion_matrix
To make things more complicated we also tried to work on a data set given by Stanford University. The hard part of the analysis is the fact that we have 5 classes, where 1 means very negative, 3 - neutral and 5 - very positive. We tried to find out if there is a possibility to create a basic model that will also have a decent outcome on this more sophisticated data. To somehow interpret the models if they match or not certain sentiments, we used confusion matrix and evaluate model according to its outcome.
def accuracy_clf(y_pred,y_true):
"Prediction accuracy (percentage) and F1 score"
acc = accuracy_score(y_true, y_pred)*100
f1 = f1_score(y_true,y_pred, average='macro')
print("Accuracy: {}\nMacro F1-score: {}".format(acc, f1))
def plot_confusion_matrix(y_true, y_pred,
classes=[1, 2, 3, 4, 5],
normalize=False,
cmap=plt.cm.YlOrBr):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
(Adapted from scikit-learn docs).
"""
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', origin='lower', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# Show all ticks
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# Label with respective list entries
xticklabels=classes, yticklabels=classes,
ylabel='True label',
xlabel='Predicted label')
# Set alignment of tick labels
plt.setp(ax.get_xticklabels(), rotation=0, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
return fig, ax
Both training and test set are not perfectly balanced in terms of class occurrences, but it shouldn't affect our models in terms of accuracy.
df_test = pd.read_csv('./sst_test.txt', sep='\t', header=None, names=['truth', 'text'])
df_test['truth'] = df_test['truth'].str.replace('__label__', '')
df_test['truth'] = df_test['truth'].astype(int).astype('category')
from textblob import TextBlob
accuracy_clf(df.pred_blob,df.truth)
accuracy_clf(df.pred_vader,df.truth)
accuracy_clf(df_test.pred_lr,df_test.truth)
accuracy_clf(df_test.pred_svm,df_test.truth)
accuracy_clf(df_test.pred_nb,df_test.truth)
The accuracy score achieved by the first (Recursive Neural Tensor Network) model specially created for this data was about 45.5%. The logistic regression which is way simpler gave us about 40% and NB 39.7%. They are much easier and interpretable but it is up to us if we want better accuracy or explanation. All models struggle with classes that are close to each other, i.e. {1,2} or {4,5}. There is a lot of misclassification in that case. Neutral class (3) is almost always omitted and classified as 2 or 4. It is hard for a human to find out if a text has irony or if it is neutral, so for these classificators it is simply impossible to make the right decision. It is worth to mention that some models are very accurate for special cases, i.e. NB is perfect in finding very negative and positive comments.